Predicting wind energy production

Francesco Mora - 100439601
Jose Maria Martínez Marín - 100443343

Nowadays, electricity networks of advanced countries rely more and more in non-operable renewable energy sources, mainly wind and solar. However, in order to integrate energy sources in the electricity network, it is required that the amount of energy to be generated to be forecasted 24 hours in advance, so that energy plants connected to the electricity network can be planned and prepared to meet supply and demand during the next day (For more details, check “Electricity Market” at Wikipedia).
This is not an issue for traditional energy sources (gas, oil, hydropower, …) because they can be generated at will (by burning more gas, for example). But solar and wind energies are not under the control of the energy operator (i.e. they are non-operable), because they depend on the weather. Therefore, they must be forecasted with high accuracy. This can be achieved to some extent by accurate weather forecasts. The Global Forecast System (GFS, USA) and the European Centre for Medium-Range Weather Forecasts (ECMWF) are two of the most important Numerical Weather Prediction models (NWP) for this purpose.

Yet, although NWP’s are very good at predicting accurately variables like “100-meter U wind component”, related to wind speed, the relation between those variables and the electricity actually produced is not straightforward. Machine Learning models can be used for this task. In particular, we are going to use meteorological variables forecasted by ECMWF (http://www.ecmwf.int/) as input attributes to a machine learning model that is able to estimate how much energy is going to be produced at the Sotavento experimental wind farm (http://www.sotaventogalicia.com/en).
More concretely, we intend to train a machine learning model f, so that:

• Given the 00:00am ECMWF forecast for variables A6:00, B6:00, C6:00, … at 6:00 am (i.e. six hours in advance)
• f(A6:00, B6:00, C6:00, …) = electricity generated at Sotavento at 6:00

We will assume that we are not experts on wind energy generation (not too far away from the truth, actually). This means we are not sure which meteorological variables are the most relevant, so we will use many of them, and let the machine learning models and attribute selection algorithms select the relevant ones. Specifically, 22 variables will be used. Some of them are clearly related to wind energy production (like “100 metre U wind component”), others not so clearly (“Leaf area index, high vegetation”). Also, it is common practice to use the value of those variables, not just at the location of interest (Sotavento in this case), but at points in a grid around Sotavento. A 5x5 grid will be used in this case.
Therefore, each meteorological variable has been instantiated at 25 different locations (location 13 is actually Sotavento). That is why, for instance, attribute iews appears 25 times in the dataset (iews.1, iews.2, …, iews.13, …, iews.25). Therefore, the dataset contains 22*25 = 550 input attributes.

0) Preliminary operations

Introduce NA's, fit the NA's, divide the dataset, scale the dataset

1) Model selection and hyper-parameter tuning

Train KNN, Random Forest, and Gradient Boosting models with and without hyper-parameter tuning.
Both BayesSearch and Optuna can be used (but Optuna will be graded better). If you use advanced implementations of GradientBoosting such as XGboost or LightBoost, your work will get higher grades.
Compare all of them using the test partition, so that we can conclude what is the best method. Also, compare results with those of trivial (dummy) models.

1a) Train a KNN model with default hyper-parameters
As a performance measure, the MAE is used. With MAE, instances with large errors don't have so much weight on the average.

1b) Train a KNN model with hyper-parameter tuning The hyper-parameters that have been considered are:

The budget for the inner evaluation is set to 10.
The best parameters are then used for the outer evaluation.

The graphs above highlight the contour plots of the different combinations of parameters.
The graph of n_neighbors vs weights shows that the function has one area wich is lighter (for n_neighbors bigger than 20).

There is not a clear trend from this graph, because the function has many local optimal points.

n_neighbors is by far the most significative parameter, followed by p.

In general, it seems that p=1 is better than p=2 but there's an extreme point, and about weights distance it's not very clear.
The most interesting values for n_neighbors are the ones between 20 and 40.

1c) Train a Random Forest model with default hyper-parameters

1d) Train a Random Forest model with hyper-parameter tuning The hyper-parameters that have been considered are:

The budget for the inner evaluation is set to 15.
The best parameters are then used for the outer evaluation.

The contour plots show clearly the shape of the objective function for each combination.
The lighter portions of the graphs are the ones with lowest objective function value.

The parameters n_estimators and max_depth are by far the most significant among the 4.

The objective function has lower values for max_depth close to 30 and min_sample_leaf close to 10. In the n_estimators graph it seems that the function decreases as n_estimators increases.

1e) Train a Gradient Boosting model with default hyper-parameters

1f) Train a Gradient Boosting model with hyper-parameter tuning The hyper-parameters that have been considered are:

The budget for the inner evaluation is set to 20.
The best parameters are then used for the outer evaluation.

In the most interesting parts of the contour plots, the surface corresponding to the lower values is quite small.

n_estimators and max_depth are more or less equally important, while learning_rate is more important.

The situations described by the three scatterplots are not so clear. Anyway it seems that the most interesting areas are:

1g) Train a Light Gradient Boosting model with hyper-parameter tuning

As a difference of the previous contour plot, here the white portions are much more predominant, so the optimal points are more "flat".

max_depth is not so important in this case.

The most interesting graph is the one of the learning rate, which assumes interesting values from 0.1 to 0.2.
Moreover, the num_leaves graph shows a kind of negative "linear" relationship and the most interesting values are close to 10.

1h) Train a CatBoost model with hyper-parameter tuning

The outer evaluation was extremely slow, the kernel had to be stopped.

The comparison of the errors and elapsed times of the seen methods is:

The best model among the ones that have been analyzed is the one with Random Forest performed with hyper-parameter tuning. The Lightboost with hyper-parameter tuning is interesting as well: the result is quite good considering the huge speed of the algorithm.
In general, the K Neighbors methods are the worse ones, and the ensemble methods are in between.
Talking about elapsed time, the Random forests are the ones that take more time, while the XGBoost seems a good compromise in terms of performance and elapsed time.

2) Attribute selection

Please note: in order to answer these questions, you should read the Attribute Selection notebook and understand the main ideas about SelectKBest and Pipeline. Use SelectKBest and Pipeline (and whatever else you need) in order to find a subset of attributes that allows to build an accurate Decision Tree mode. Use the best model of the previous section, but if it is too slow, use Decision Trees (they're faster).
Use the test partition in order to compare different models.

To check the optimal number of attributes, the SelectKBest function is used inside a pipeline which contains also the regressor. The pipeline itself is then a method.
In this case, the Random Forest Regressor is chosen, but to speed-up the processing time, only one parameter is taken (regression__max_depth).
For the same reason, the range for the parameter select__k is only multiples of 25 until 200.

Let's plot now the scores of each attribute, to check are the most important attributes of the dataset.

It can be seen that the most important attributes are not the ones of Sotavento.
The most important Sotavento attribute, indeed, is not even in the first 10 attributes!
This shows that it's important to consider other spots rather than only the ones of real interest.

The graph shows as well that there's a big drop in the score, after the 25th attribute.